Application aware input/output fencing

ABSTRACT

Disclosed herein are methods, systems, and processes to perform application aware input/output (I/O) fencing operations. Performing such an application aware I/O fencing operation includes installing an identifier that identifies an instance of an application with a node on which the instance of the application is executing, on coordination points. A weight assigned to the instance of the application is determined, and the instance of the application is terminated based on the weight.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims benefit under 35 U.S.C. §119(a) of pendingIndian Patent Application No. 201621022438, filed in India on Jun. 30,2016, entitled “Application Aware Input/Output Fencing,” and having JaiGahlot and Abhijit Toley as inventors. The above-referenced applicationis incorporated by reference herein, in its entirety and for allpurposes.

FIELD OF THE DISCLOSURE

This disclosure relates to distributed storage in computing clusters. Inparticular, this disclosure relates to performing application-awarefencing operations in such clusters.

DESCRIPTION OF THE RELATED ART

Modern companies and organizations provide a variety of online servicesfor their employees, customers, and users. Providing such servicesrequires a variety of software applications (e.g., a database managementsystem (DBMS), and the like). For example, a business may implement adatabase with pertinent information necessary for an e-commercetransaction where Extract, Transform, and Load (ETL) processes are usedto extract data from the database, transform the data for storing (e.g.,for querying, analysis, and the like), and load the data for utilization(e.g., into a data store, a data warehouse, and the like).

Various applications can be used to perform individual tasks of ETLprocesses. For example, an extract application can extract data from thedatabase, a transform application can change the format of the extracteddata, and a load application can load the transformed data into a datastore. These different applications can be configured to run on multiplenodes (or computing devices) that are part of a cluster.

A cluster is a distributed computing system with several nodes that worktogether to provide processing power and storage resources by spreadingprocessing load over more than one node, thereby eliminating or at leastminimizing single points of failure. Therefore, different applicationsrunning on multiple nodes can continue to function despite a problemwith one node (or computing device) in the cluster.

“Split-brain” refers to a condition (or situation) where theavailability of data (e.g., from shared storage) is inconsistent due tomaintenance of separate data sets that overlap in scope. For example,such overlap can potentially occur because of a network partition wheresub-clusters are unable to communicate with each other to synchronizetheir respective data sets. The data sets of each sub-cluster (ornetwork partition) may randomly serve clients by their own idiosyncraticdata set updates, without coordination with other data sets from othersub-clusters. Therefore, when a split-brain condition occurs in acluster, the decision of which sub-cluster should continue to operate(called a partition arbitration process, or simply arbitration) can bemade by performing fencing operations using coordination points.

Input/output (I/O) fencing (or simply, fencing) refers to the process ofisolating a node of a cluster, and/or protecting shared resources of thecluster when the node malfunctions (or appears to malfunction). Becausea cluster has multiple nodes, there is a likelihood that one of thenodes may fail at some point. The failed node may have control overshared resources such as shared storage used and required by the othernodes in the cluster. A cluster must be capable of taking correctionaction with a node fails, because as noted earlier, data corruption canoccur if two nodes in different sub-clusters or network partitionsattempt to take control of shared storage in an uncoordinated manner.Therefore, a fencing operation results in the fencing-off (ortermination) of one or more nodes in the cluster.

Coordination points can be implemented in a cluster to assist withfencing operations. Coordination points are computing devices thatprovide a lock mechanism to determine which node (or nodes) are allowedto fence off shared storage (e.g., data drives) from other nodes in thecluster. For example, a node must eject (or uninstall) a registrationkey of a peer node from a coordination point before that node is allowedto fence the peer node from shared storage.

As previously noted, different applications can be configured to run onmultiple nodes that are part of a cluster (e.g., in an ETL workloadenvironment). In such environments, a network partition, as describedabove, does not necessarily result in a split-brain condition. Forexample, if all nodes on which a given application is running are in thesame network partition, no split-brain condition exists (e.g., there isno risk of uncoordinated access to data).

However, a partitioned cluster can experience a split-brain conditioneven if there is no node failure (or a risk of node failure). Forexample, a “application split-brain condition” can be caused byuncoordinated access to data by various instances of an application thatare running on separate sub-clusters of a partitioned cluster. Forexample, if a cluster is partitioned into two separate partitions, twoinstance of the same application (e.g., instances A and B of anapplication) running on the two separate partitions can cause asplit-brain condition because each instance of the application canattempt to take control of shared storage in an uncoordinated manner,thus giving rise to a risk of data corruption.

If a traditional fencing solution (e.g., as described above) isimplemented, the node on which instance A or instance B of theapplication is running is terminated as part of a fencing operation.Therefore, under a traditional fencing paradigm, nodes in all but onenetwork partition of a cluster are terminated. Unfortunately, such aresult compromises the availability of the cluster because a traditionalfencing operation results in the termination of healthy nodes in asub-cluster even if there is no split-brain condition or if there is anapplication-induced split-brain condition (e.g., as described above).These healthy nodes can be utilized for other computing purposes.Therefore, terminating healthy nodes under such circumstances isredundant, undesirable, and negatively affects cluster availability.

SUMMARY OF THE DISCLOSURE

Disclosed herein are various systems, methods, and processes to performapplication-aware input/output (I/O) fencing operations. One such methodinvolves determining that an instance of an application is executing ona node. The node is one of multiple nodes that are part of a cluster. Inresponse to the determination that the instance of the application isexecuting on the node, the method generates an identifier for theinstance of the application that associates the instance of theapplication and the node on which the instance of the application isexecuting. The method then installs the identifier on coordinationpoint(s).

In one embodiment, the method determines whether instances of otherapplications are executing on the node. In response to the determinationthat instances of other applications are executing on the node, themethod generates other identifiers for instances of other applicationsthat associate each of the instances and the node. The method theninstalls (or registers) the other identifiers on the coordinationpoint(s).

In some embodiments, the identifier is a registration key and the otheridentifiers are other registration keys. In this example, the method cangenerate a coordination point registration key matrix that includesmultiple registration keys (e.g., the registration key and the otherregistration keys) that are stored on the coordination point(s). Thecoordination point registration key matrix is maintained on thecoordination point(s).

In other embodiments, the nodes are communicatively coupled to eachother, and the coordination point(s) are communicatively coupled to thenodes but are not part of the cluster. The instance of the applicationand the one or more instances of the other applications are part ofmultiple application instances. Each application instance of themultiple application instances executes on one or more nodes. Theapplication instances include multiple disparate application instances,including, but not limited to, disparate application instances that canbe used to perform multiple Extract, Transform, and Load (ETL)processes.

In certain embodiments, the method receives an application weight matrixthat includes a weight assigned to each application, a total applicationweight, and a total node weight. In this example, the application weightmatrix is transmitted to each node that is communicatively coupled tothe node.

In one embodiment, the method determines whether the cluster ispartitioned into network partitions, and determines whether asplit-brain condition exists in the cluster as a result of thepartitioning. In this example, the split-brain condition is caused byone or more application instances executing on one or more nodes in thecluster. The method performs an application fencing operation to rectifythe (application) split-brain condition by accessing the applicationweight matrix, and performing a partition arbitration process. In someembodiments, as part of performing the application fencing operation,the method uninstalls (or ejects) registration key(s) of applicationinstance(s) from coordination point(s) based on a result of thepartition arbitration process. The uninstalling causes the terminationof application instance(s) instead of node(s) on which the applicationinstance(s) are executing.

In some embodiments, the method performs an application fencingoperation by installing an identifier on one or more coordinationpoints. In this example, the identifier associates an instance of theapplication with a node on which the instance of the application isexecuting. The method then determines a weight assigned to the instanceof the application, and terminates the instance of the applicationbased, at least in part, on the weight.

In other embodiments, as part of performing the application fencingoperation, the method causes termination of the instance of theapplication instead of the node on which the instance of the applicationis executing. The method accesses an application weight matrix thatincludes the weight assigned to the instance of the application. Themethod receives the application weight matrix, generates the identifierfor the instance of the application (e.g., a registration key), and aspart of the installing of the identifier, stores the registration key onone or more coordination points. In this example, the coordinationpoints include one or more coordinator disks or one or more coordinationpoints servers.

In certain embodiments, the method determines whether a cluster ispartitioned into network partitions, accesses the application weightmatrix, and performs a partition arbitration process using theapplication weight matrix. The partition arbitration process includes afencing race to determine a winner partition and one (or more) loserpartitions. In this example, the method excludes the instance of theapplication and other instances of the application from the fencingrace, if both the instance of the application and other instances of theapplication execute on nodes that are part of a same network partition.Conversely, the method includes the instance of the application andother instances of the application in the fencing race, if theapplication and other instances of the application execute on separatenetwork partitions.

In some embodiments, the fencing race to determine the winner partitionand loser partition(s) is decided based on information in theapplication weight matrix. In this example, and as part of performingthe fencing race, the method uninstalls or ejects the registration keyfor the instance of the application from a coordination point based onthe weight assigned to the instance of the application.

In one embodiment, the method determines that a cluster has beenpartitioned. In this example the cluster includes multiple nodes, and asa result of the partitioning, the nodes are split between a firstnetwork partition that includes a first set of nodes and a secondnetwork partition that includes a second set of nodes. The methoddetermines that instances of an application are executing on the firstset of nodes and the second set of nodes. The method then performs anapplication fencing operation that causes termination of instances ofthe application executing on the first set of nodes or on the second setof nodes.

In some embodiments, the method performs a fencing race by accessing anapplication weight matrix that includes a weight assigned to theapplication. The method then compares a first total application weightof the first set of nodes and a second total application weight of thesecond set of nodes. The method bypasses the fencing race, if allinstances of the application are executing on the first set of nodes inthe first network partition or on the second set of nodes in the secondnetwork partition, and broadcasts a message to one or more nodes onwhich one or more remaining instances of the application are executing.

In other embodiments, the method determines whether the first totalapplication weight of instances of the application executing on thefirst set of nodes in the first network partition is greater than thesecond total application weight of instances of the applicationexecuting on the second set of nodes in the second partition.

In certain embodiments, performing the fencing race further includesuninstalling a registration key associated with each instance of theapplication executing on the second set of nodes in the second partitionfrom one or more coordination points. The method determines whether thefirst total application weight of instances of the application executingon the first set of nodes in the first network partition is lesser thanthe second total application weight of instances of the applicationexecuting on the second set of nodes in the second partition, and basedon the determining, performs the fencing race after a delay. In thisexample, the delay is based on a time required for a second racer nodethat is part of the second set of nodes to perform another fencing race.

The foregoing is a summary and thus contains, by necessity,simplifications, generalizations and omissions of detail; consequentlythose skilled in the art will appreciate that the summary isillustrative only and is not intended to be in any limiting. Otheraspects, inventive features, and advantages of the present invention, asdefined solely by the claims, will become apparent in the non-limitingdetailed description set forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure may be better understood, and its numerousobjects, features and advantages made apparent to those skilled in theart by referencing the accompanying drawings.

FIG. 1A is a block diagram of a computing system that can performapplication fencing operations, according to one embodiment of thepresent disclosure.

FIG. 1B is a block diagram of a nodes in a cluster that can performapplication fencing operations, according to one embodiment of thepresent disclosure.

FIG. 2A is a block diagram of a partitioned cluster that does notexperience a split-brain condition, according to one embodiment of thepresent disclosure.

FIG. 2B is a block diagram of a partitioned cluster that experiences anapplication-induced split-brain condition, according to one embodimentof the present disclosure.

FIG. 3 is a table illustrating an application weight matrix, accordingto one embodiment of the present disclosure.

FIG. 4 is a table illustrating a coordination point registration keymatrix, according to one embodiment of the present disclosure.

FIG. 5 is a block diagram of a computing system that registersapplication-aware registration keys on coordination points, according toone embodiment of the present disclosure.

FIG. 6 is a block diagram of nodes that store application-awareregistration keys on coordination points, according to one embodiment ofthe present disclosure.

FIG. 7 is a block diagram of racer nodes that perform application-awarepartition arbitration, according to one embodiment of the presentdisclosure.

FIG. 8A is a flowchart of a process for receiving an application weightmatrix, according to one embodiment of the present disclosure.

FIG. 8B is a flowchart of a process for generating application-awareregistration keys, according to one embodiment of the presentdisclosure.

FIG. 9A is a flowchart of a process for installing application-awareregistration keys on coordination points, according to one embodiment ofthe present disclosure.

FIG. 9B is a flowchart of a process for generating a coordination pointregistration key matrix, according to one embodiment of the presentdisclosure.

FIG. 10 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment of the presentdisclosure.

FIG. 11 is a flowchart of a process for uninstalling application-awareregistration keys from coordination points, according to one embodimentof the present disclosure.

FIG. 12 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment of the presentdisclosure.

FIG. 13 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment of the presentdisclosure.

FIG. 14 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment of the presentdisclosure.

FIG. 15 is a block diagram of a computing system, illustrating how afencing module can be implemented in software, according to oneembodiment of the present disclosure.

FIG. 16 is a block diagram of a networked system, illustrating howvarious computing devices can communicate via a network, according toone embodiment of the present disclosure.

While the disclosure is susceptible to various modifications andalternative forms, specific embodiments of the disclosure are providedas examples in the drawings and detailed description. It should beunderstood that the drawings and detailed description are not intendedto limit the disclosure to the particular form disclosed. Instead, theintention is to cover all modifications, equivalents and alternativesfalling within the spirit and scope of the disclosure as defined by theappended claims.

DETAILED DESCRIPTION Introduction

Because modern businesses, companies, and/or organizations increasinglyrely on complex computer systems (e.g., distributed storage and/orcomputing systems) for their daily operations, managing the vast amountof data generated and processed by applications executing on suchcomputer systems is a significant challenge. Various applications aretypically used to manage large quantities of data stored on differenttypes of storage devices across various networks and operating systemplatforms. To efficiently manage data in distributed storage and/orcomputing systems, Storage Area Networks (SANs) including many differenttypes of storage devices can be implemented. SANs provide a variety oftopologies and capabilities for interconnecting storage devices,subsystems, and server systems. For example, a variety of interconnectentities, such as switches, hubs, and bridges, can be used tointerconnect these components.

As previously noted, a cluster includes multiple interconnectedcomputers that appear as one computer to end users and applications.Each interconnected computer in the cluster is called a node. Thecombined processing power of multiple nodes can provide greaterthroughput and scalability than is available from a single node. Inhigh-availability clusters, multiple nodes can execute instances of thesame application and/or instances of different applications. These nodescan share a storage device for the purpose of data storage, replicationand/or deduplication. A shared storage disk/device (e.g., a ClusterShared Volume (CSV)) can be made accessible for read and writeoperations by various nodes and applications within a cluster. Eachcluster can have multiple CSVs. In Flexible Shared Storage (FSS)systems, multiple nodes in a cluster share one or more CSVs. Thus, FSSsystems enable cluster-wide network sharing of local storage (e.g., inthe form of Direct Attached Storage (DAS), internal disk drives, and thelike). Also as previously noted, the network sharing of storage can beenabled through the use of a network interconnect among the nodes of thecluster. This feature allows network shared storage to co-exist withphysically shared storage. Therefore, distributed storage systems can beimplemented in a multi-node cluster to provide to high-availability ofdata from one or more storage devices.

One known problem in clusters occurs when one or more nodes of thecluster erroneously believes that other node(s) are either notfunctioning properly or have left the cluster. This “split-brain”condition results in the effective partitioning of the cluster into twoor more sub-clusters (also called “network partitions”). Causes of asplit-brain condition include, among other reasons, failure of thecommunication channels between nodes, and the processing load on onenode causing an excessive delay in the normal sequence of communicationamong nodes (e.g., one node fails to transmit its heartbeat signal foran excessive period of time).

In addition, and as noted above, a partitioned cluster can experience an“application split-brain” condition that can be caused by uncoordinatedaccess to data by various instances of an application that are executingon separate sub-clusters of the partitioned cluster. For example, if acluster is partitioned into two separate network partitions, twoinstance of the same application (e.g., instances A and B of anapplication) running on the two separate network partitions can cause anapplication-induced split-brain condition because each instance of theapplication can attempt to take control of shared storage in anuncoordinated manner, thus giving rise to a risk of data corruption.

For example, if a cluster is configured for a failover operation with anapplication instance executing on a first node, and another instance ofthe application executing on a second node existing in the cluster is totakeover for the first node should it fail, then complete failure of anetwork would lead the second node to conclude that the first node hasfailed. The another instance of the application executing on the secondnode then begins operations even though the first node has not in factfailed.

Thus, the potential exists for the instance of the application executingon the first node and the other instance of the application executing onthe second node to attempt to write data to the same portion (or area)of one of the storage devices in the distributed storage system therebycausing data corruption. The traditional solution is to ensure that oneof the nodes cannot access the shared storage, and as noted above,input/output fencing (or more simply, just fencing) can be implementedto “fence off” the node from the shared storage.

In the event that communication between the nodes fails, such as when aportion of the network fails during a network partition, each of two ormore sub-clusters of nodes can determine that the other sub-cluster ofnodes has failed (or might have failed). For example, a race (alsocalled a “fencing race”) can occur between the two (or more)sub-clusters of nodes, with control modules of each sub-cluster of nodesdetermining that the other sub-cluster of nodes is malfunctioning. Aspreviously noted, an instance of an application executing on node(s) ina first sub-cluster (or network partition) can perform data writes tothe storage device(s), while another instance of the applicationexecuting on node(s) in the other sub-cluster (or other networkpartition) can also perform data writes to the same portion(s) of theshared storage devices, resulting in data inconsistency errors. In orderto prevent these data inconsistency errors, an “arbitration process” isperformed that determines winner and loser sub-clusters (or groups).

Nodes in the winner sub-cluster are determined to keep communicatingwith the shared storage, whereas nodes in the loser sub-cluster aredetermined to stop communicating with these shared storage devices.However, nodes in the winner sub-cluster do not determine if or whennodes in the loser sub-cluster(s) will conclude that they have lostarbitration (and thus desist from using the shared storage devices).Thus, in addition to this arbitration process, the control module of thewinner sub-cluster of node(s) can perform a fencing process that fencesnodes from the loser sub-cluster(s) from the rest of the distributedstorage system. The fencing process is performed to ensure that nodesfrom the loser sub-cluster(s) do not communicate with the storagedevices, as improper data writes from the loser nodes would causevarious data inconsistency and other errors.

Generally, fencing can be performed in two stages. In the first stage,fencing attempts to fence out the loser sub-cluster(s) of nodes. Theloser nodes can be notified about the outcome of the arbitration and/orabout the fencing process being implemented. Such notification caninclude the control module(s) of the loser sub-cluster of nodesdiscovering that the loser sub-cluster lost the arbitration process, thecontrol module(s) of the loser sub-cluster of nodes encounteringcommunication errors that indicate loss of communication, and/orcommunication from the loser nodes being disabled, among others.

In the second stage, fencing removes loser nodes' access to the storagedevices, such as by instructing the shared storage devices (that areaccessible to the winner sub-cluster) to not accept any communicationfrom the loser nodes. In this case, even if the control module(s) of thewinner sub-cluster of nodes cannot ensure that loser sub-cluster(s) ofnodes are no longer performing data writes to the shared storage devices(such as by executing instance(s) of a shared application), the losernodes will not be able to access/modify application data being used bywinner nodes. In effect, this fencing mechanism prevents a portion ofthe cluster from accessing the shared storage devices in anuncoordinated manner.

When performing fencing, it is preferable to use separate computingdevices that enable access to shared storage by multiple nodes, andsimultaneously block access (to shared storage) by other nodes. Usingsuch separate (and independent) computing devices adds resilience to adistributed storage system during fencing operations by providingadditional arbitration mechanisms that integrate seamlessly withexisting fencing software running on nodes in a cluster. In addition,such separate computing devices act (or function) as intermediarydevices that are dedicated to performing (and managing) fencingoperations, thus improving the speed and efficiency of the distributedstorage system.

Coordination points (CPs) can be implemented in a cluster as the abovementioned separate computing devices to assist with fencing operations.Coordination points provide a lock mechanism to determine which node (ornodes) are allowed to fence off shared storage (e.g., data drives) fromother nodes in the cluster. In addition, coordination points arededicated devices that enable access to shared storage for multiplenodes, and simultaneously block access (to shared storage) for othernodes (in a cluster). In high-availability clusters, servers, disks,interconnects, and/or other hardware and/or software appliances can beimplemented (or used) as coordination points (external to the cluster)to ensure data integrity in case of loss of hardware and/or softwarecomponents in the cluster. Therefore, coordination points are vital inproviding data protection and maintaining high availability in acluster.

If a traditional fencing solution (e.g., as described above) isimplemented, the node on which instance A or instance B of theapplication is running is terminated as part of a fencing operation.Therefore, under a traditional fencing paradigm, nodes in all but onenetwork partition of a cluster are terminated. Unfortunately, such aresult compromises the availability of the cluster because a traditionalfencing operation results in the termination of healthy nodes in asub-cluster even if there is no split-brain condition or if there is anapplication-induced split-brain condition (e.g., as described above).These healthy nodes can be utilized for other computing purposes.Therefore, terminating healthy nodes under such circumstances isredundant, undesirable, and negatively affects cluster availability.

Described herein are methods, systems, and processes to performapplication fencing operations by causing the termination of an instanceof an application (e.g., if that instance is responsible for anapplication-induced split-brain condition) instead of terminating thenode on which that instance of the application is executing.

An Example Computing System to Perform Application-Aware I/O Fencing

FIG. 1A is a block diagram of a computing system that performsapplication I/O fencing operations, according to one embodiment. FIG. 1Aincludes a configuration system 105 and a cluster 120, communicativelycoupled to each other via a network 115. Multiple nodes execute incluster 120 (e.g., nodes 125(1)-(N)). Configuration system 105 and nodes125(1)-(N) can be any type of computing device including a server, adesktop, a laptop, a tablet, and the like. Configuration system 105includes a configuration file 110. As noted, cluster 120 implements andexecutes nodes 125(1)-(N). As shown in FIG. 1A, node 125(1) includes aprocessor 130 and a memory 135. Memory 135 implements severalapplications (e.g., applications 140, 145, 150, and 155). Memory 135also includes configuration file 110, which further includes anapplication weight matrix (AWM) 160. Memory 135 also implements aninput/output (I/O) fencing application 165 (or simply fencingapplication 165) with an I/O fencing driver 170.

Configuration file 110 is generated by configuration system 105 and canbe transmitted to node 125(1) via network 115. Configuration file 110contains information regarding coordination points as well as AWM 160.For example, configuration 110 can identify the coordination pointsimplemented in the computing system of FIG. 1A (not shown), and caninclude information regarding the total number of coordination points aswell. For example, configuration file 110 can identify a total of threecoordination points that are implemented in a distributed computingsystem. In one embodiment, configuration file 110 can be created by anadministrator and/or user of configuration system 105. Once generated byconfiguration system 105, configuration file 110 with AWM 160 can betransmitted to node 125(1) and can be used by node 125(1) to performapplication fencing operations.

FIG. 1B is a block diagram of a distributed computing system thatperforms application-aware I/O fencing, according to one embodiment. Asshown in FIG. 1B, cluster 120 includes nodes 125(1)-(3). Node 125(1)executes application 140(1) and includes configuration file 110(1) withAWM 160(1), and a fencing module 175(1). Similarly, node 125(2) executesapplication 140(2) (e.g., a second instance of application 140) andincludes configuration file 110(1) with AWM 160(2), and fencing module175(2). However, node 125(3) executes application 145(1) (a differentapplication instance), but like nodes 125(1) and 125(2), includes aconfiguration file, an AWM, and a fencing module (e.g., configurationfile 110(3) with AWM 160(3), and fencing module 175(3)). A fencingmodule (also called a “fencing control unit”) can be implemented on eachnode (e.g., by configuration system 105). In some embodiments, thefencing module can be a kernel module. Fencing modules 175(1)-(3) (orfencing control units 175(1)-(3)) are responsible for ensuring valid andcurrent cluster membership (or membership change) through membershiparbitration (e.g., the arbitration process as described above).

In some embodiments, fencing modules 175(1)-(3) also register nodes125(1)-(3) as well as instance(s) of application(s) executing on nodes125(1)-(3) with coordination points (CPs) 180(1)-(N). For example,fencing module 175(1), can place (or install/register) anapplication-aware registration key identifying node 125(1) and one ormore instances of applications executing on node 125(1) on coordinationpoints 180(1)-(N) using AWM 160(1). Similarly, fencing modules 175(2)and 175(3), each place (or install/register) an application-awareregistration key identifying nodes 125(2) and 125(3) and instance(s) ofapplication(s) executing on nodes 125(2) and 125(3) on coordinationpoints 180(1)-(N) using AWMs 160(2) and 160(3), respectively. Therefore,registration keys 185(1)-(N) are application-aware registration keys ofnodes 125(1)-(3). It will be appreciated that as used herein, the term“registration key” refers to an “application-aware registration key” asdescribed above (e.g., an association between a node and instance(s) ofapplication(s) executing on that node).

As shown, FIG. 1B also includes a storage area network (SAN) 190 whichimplements data disks 195(1)-(N). SAN 190, coordination points180(1)-(N) and nodes 125(1)-(3) are communicatively coupled to eachother via network 115. It should be noted that configuration files110(1)-(3) received from configuration system 105 include the sameinformation (e.g., AWMs 160(1), 160(2) and 160(3), respectively).

An Example of a Network Partitioning Event that does not CauseSplit-Brain

FIG. 2A is a block diagram of a partitioned cluster that does notexperience a split-brain condition, according to one embodiment. Cluster120 is partitioned into two sub-clusters (e.g., partition 210 andpartition 220) using different communication channels. Partition 210includes nodes 125(1)-(3) and partition 220 includes nodes 125(4)-(6).Node 125(1) executes application 140(1), node 125(2) executesapplication 140(2), and node 125(3) executes application 140(3). Node125(4) executes application 145(1), node 125(5) executes application145(2), and node 125(6) executes application 145(3). Applications 140and 145 are separate and distinct applications, instances of whichexecute in partitions 210 and 220, respectively.

Therefore, if cluster 120 is partitioned, there is no split-braincondition because partitions 210 and 220 are running (or executing)separate and independent applications (e.g., applications 140 and 145).Performing a traditional fencing operation in this scenario would resultin partition 220 being ejected out of cluster 120 and the termination ofnodes 125(4)-(6) even though there is no split-brain condition (e.g.,there is no risk that different instances of one application willperform I/O operations to data disks 195(1)-(N) upon cluster partition).Therefore, under a tradition fencing paradigm, cluster 120 would losehealthy nodes (e.g., nodes 125(4)-(6)) and would result in anunnecessary and redundant failover of application 145. In thissituation, it would be desirable for both partitions 210 and 220 tocontinue operating as there is no potential data corruption.

An Example of a Network Partitioning Event that CausesApplication-Induced Split-Brain

FIG. 2B is a block diagram of a partitioned cluster that experiences anapplication-induced split-brain condition, according to one embodiment.As shown in FIG. 2B, cluster 120 is partitioned into two sub-clusters(e.g., partitions 210 and 220). Partition 210 includes nodes 125(1)-(3),and partition 220 includes nodes 125(4)-(6). Node 125(1) executesapplication 140(1), node 125(2) executes application 140(2), and node125(3) executes application 145(1). Similarly, node 125(4) executesapplication 145(2), node 125(5) executes application 150(1), and node125(6) executes application 150(2). Applications 140, 145, and 150 areseparate and distinct applications.

As shown in FIG. 2B, all instances of application 140 (e.g.,applications 140(1) and 140(2)) execute entirely in partition 210, andall instances of application 150 (e.g., applications 150(1) and 150(2))execute entirely in partition 220. Therefore, there is no split-braincondition experienced by cluster 120 as a result of applications 140 and150. However, different instances of application 145 (e.g., applications145(1) and 145(2)) execute on separate sub-clusters as a result ofcluster partitioning. For example, application 145(1) executes onpartition 210 and application 145(2) executes on partition 220.Therefore, in this scenario, cluster 120 experiences anapplication-induced split-brain condition caused by application 145because there is a risk of data corruption that can be caused bydifferent instances of application 145 (e.g., applications 145(1) and145(2)) performing I/O operations at the same time.

Performing a traditional fencing operation under such circumstanceswould result in the ejection of partition 220 out of cluster 120,termination of nodes 125(4)-(6) (e.g., as shown by dotted lines in FIG.2B), and an unnecessary and redundant failover of instances ofapplication 150 (e.g., applications 150(1) and 150(2)) when only thefailover of application 145 (e.g., application 145(2)) is required torectify the split-brain condition in cluster 120. In this situation, itwould be desirable for nodes 125(5) and 125(6) to continue operating inpartition 220 as there is no potential for data corruption to be causedby applications 150(1) and 150(2).

Therefore, and as noted above, performing traditional fencing operationsin distributed computing systems that implement multiple instances ofdisparate applications results in at least two shortcomings. First, atradition fencing operation results in the termination of healthy nodesin a cluster even if there is no split-brain condition (e.g., as shownin FIG. 2A). Second, a traditional fencing operation results intermination of healthy nodes in a cluster even if instance(s) ofapplication(s) executing on those (healthy) nodes are not responsiblefor an application-induced split-brain condition (e.g., as shown in FIG.2B).

An Example of an Application Weight Matrix

FIG. 3 is a block diagram of an application weight matrix (AWM),according to one embodiment. As noted above, an AWM can be generated bya system administrator and can be transmitted to one or more nodes aspart of a configuration file. Each node in a cluster can maintain a copyof an AWM (e.g., as shown in FIG. 1B). For example, the AWM can betransmitted to each node in cluster 120 that is communicatively coupledto the node(s) that receives the AWM from configuration system 105.

As shown in FIG. 3, AWM 160 includes a list of all applications (andinstances of all such applications) executing in cluster 120. AWM 160also identifies one or more nodes on which such instances ofapplications are executing, whether the nodes are failover nodes orparallel nodes, a relative criticality of each application (e.g., in theform of a numerical value assigned to each application called anapplication weight, or simply weight), as well as any changes to theforegoing information.

AWM 160 as shown in FIG. 3, identifies nodes 125(1)-(4) and a weight ofeach application that is executing on each of nodes 125(1)-(4). Forexample, application 140 executes on nodes 125(1) and 125(2) (and has aweight of 2), application 145 executes on nodes 125(2) and 125(3) (andhas a weight of 4), application 150 executes on nodes 125(3) and 125(4)(and has a weight of 6), and application 155 executes on nodes 125(1),125(2), and 125(3) (and has a weight of 8). AWM 160 includes a totalapplication weight field 320 which includes a total application weightof an application executing on one or more nodes in cluster 120. AWM 160also includes a total node weight field 330 which includes a total nodeweight of one or more applications executing on a particular node. Insome embodiments, information contained in AWM 160 can be used toperform application fencing operations.

An Example of Registering Application-Aware Registration Keys onCoordination Points

FIG. 4 is a block diagram of a coordination point registration keymatrix (CPRKM), according to one embodiment. It will be appreciated thatCPRKM 410 visually illustrates the registration, placement, orinstallation of specific application-aware registration keys oncoordination points. CPRKM 410 may or may not be used in a table format.If CPRKM 410 is used as a matrix or table by fencing modules 175(1)-(N),then CPRKM 410 can be maintained by or stored on coordination points180(1)-(N). However, CPRKM 410 is not required in addition to theregistration keys themselves to perform application I/O fencingoperations.

Under a tradition fencing paradigm, only a node is identified andregistered in the form of a key (e.g., a key is registered on acoordination point per node, and partition arbitration is performed onthe basis of this key). However, in one embodiment, application-awareregistration keys are installed, placed, or registered on coordinationpoints by a fencing module. For example, fencing module 175(1)registers, places, or installs a registration key pertaining to eachapplication running or executing on a particular node (e.g., anapplication-aware registration key). This “application-aware”registration key can be used in some embodiments to perform partitionarbitration in the context of specific applications.

Each application-aware registration key contains at least two pieces ofinformation—a node identifier and an application identifier(collectively referred to herein as “identifier”). For example, CPRKM410 of FIG. 4 contains 9 registration keys 185(1)-(9). Each registrationkey has a node identifier (e.g., 125(1)) and an application identifier(e.g., 140). Therefore, registration keys 185(1)-(9) (or identifiers)installed on coordination points 180(1)-(N) are 125(1)-140, 125(2)-140,125(2)-145, 125(3)-145, 125(3)-150, 125(4)-150, 125(1)-155, 125(2)-155,and 125(4)-155, respectively.

In one embodiment, fencing application 165 (which includes a fencingdriver and a fencing module) determines that an instance of anapplication is executing on a node, and generates an identifier for theinstance of the application that associates the instance of theapplication and the node on which the instance of the application isexecuting (e.g., 125(1)-140). Fencing application 165 then installs theidentifier on coordination point(s). In another embodiment, fencingapplication 165 determines whether instances of other applications areexecuting on the node. If instances of other applications are executingon the node, fencing application 165 generates other identifiers forinstances of other applications that associate each of the instances andthe node (e.g., 125(1)-155). Fencing application 165 then installs (orregisters) the other identifier(s) on the coordination point(s). Incertain embodiments, the identifier and the other identifier(s) areapplication-aware registration keys.

In some embodiments, fencing application 165 can determine whethercluster 120 is partitioned into two (or more) network partitions (e.g.,partitions 210 and 220). Fencing application 165, using a fencingmodule, can determine whether a split-brain condition exists in cluster120 as a result of the cluster partitioning and whether the split-braincondition is caused by one or more application instances executing onone or more nodes in the cluster (e.g., an “application-induced”split-brain condition as shown in FIG. 2B).

In other embodiments, fencing application 165 performs an applicationfencing operation to rectify the application split-brain condition byaccessing AWM 160 and performing a partition arbitration process. Inthis example, and as part of performing the application fencingoperation, fencing application 165 uninstalls, removes, or ejectsapplication-aware registration key(s) of application instance(s) fromcoordination point(s) based on a result of the partition arbitrationprocess (which is performed in the context of specific applications).The uninstalling, removing, or ejection, causes the termination ofapplication instance(s) instead of node(s) on which the applicationinstance(s) are executing.

FIG. 5 is a block diagram of a computing system that registersapplication-aware registration keys on coordination points, according toone embodiment. As previously noted, coordination points can beimplemented in a cluster to assist with fencing operations. Coordinationpoints provide a lock mechanism to determine which nodes are allowed tofence off shared storage (e.g., data drives 195(1)-(N)) from other nodesin the cluster. For example, a node (e.g., a racer node) must eject theregistration key of a peer node from a coordination point (e.g., from acoordinator disk buffer 510(1) of coordination point 180(1) or from acoordinator point server daemon 520(1) of coordination point 180(3))before that node is allowed to fence the peer node from shared storage.Coordination points can be either disks or servers, or both. Typically,and in one embodiment, cluster 120 includes at least three (3)coordination points, which can be a combination of disks and/or servers.

Disks that function as coordination points are called coordinator disks.In one embodiment, coordinator disks are three (3) standard disks orLUNs (Logical Unit Numbers) set aside for application fencing duringcluster reconfiguration (e.g., before a cluster is formed). Coordinatordisks (and coordination points) do not serve any other storage purposein a cluster (e.g., such as data storage or inclusion in a disk groupfor user data). Any disks that support SCSI-3 Persistent Reservation(SCSI-3 PR) can be coordinator disks. In another embodiment, acoordination point can also be a server called a coordination pointserver. A coordination point server is a software solution that runs ona remote computing system or cluster. Therefore, regardless of whether acoordination point is a coordinator disk or a coordination point server,a coordination point permits node(s) in a cluster to at least: (1)register and become a member of a cluster, (2) determine which othernodes have successfully registered as members of the same cluster, (3)un-register from the cluster, and (4) forcefully un-register and preemptother nodes as members of the cluster.

In some embodiments, coordination points are, at a minimum, anycombination of three (3) coordinator disks or coordination point serversthat act together as a global lock device because racing for control ofthese coordination points (e.g., in a fencing race) is used to determinecluster membership. Because control of a cluster is granted to a nodethat gains control of (or wins) a majority of coordination points, it ispreferable to have an odd number of coordination points (e.g., any oddnumber combination of coordinator disks and/or coordination pointservers), though such is not strictly necessary. In one embodiment, amaximum of three (3) coordinator disks or coordination point servers (orany combination of the two) are implemented.

As previously noted, fencing can be used to ensure that only onepartition (or sub-cluster) survives in a cluster which has experiencednetwork partition such that only the surviving partition is able towrite to shared storage. Application fencing, as described herein, usesa fencing race to determine which partition or sub-cluster gets to fenceoff application instances executing on the nodes in the othersub-cluster or partition. Because coordination points are used to manageaccess to shared storage, in one embodiment, the fencing race refers tonodes in different sub-clusters or partitions racing to gain access to(or reach) the majority of coordination points. Therefore, the fencingrace refers to a partition or sub-cluster of nodes that has connectivity(or accessibility) to a majority of coordination points.

It should be noted that nodes in a sub-cluster (or partition) requireaccess to a majority of coordination points because having just onecoordination point available to a cluster can give rise to a singlepoint of failure. For example, if a single coordination point fails forany reason, the cluster can lose operational capabilities. Further,using two (2) (or an even number of) coordination points (e.g., four(4), six (6), etc.) can result in a situation where no sub-cluster candefinitively win a fencing race because node(s) in differentsub-clusters can access (and win) the same number of, but albeitdifferent, coordination points (e.g., in a situation where a cluster ispartitioned into two sub-clusters with two (2) or four (4) availablecoordination points).

Therefore, using a single coordination point or an even number ofcoordination points can result in nodes in both sub-clusters writingdata to shared storage, thus causing data corruption. Therefore, to keepa desired partition operational in a cluster that has been partitioned,a node in a sub-cluster, either alone or in combination with other nodesin that sub-cluster, must be able to access (and win) a majority of thecoordination points available to the cluster (e.g., a task that can onlybe accomplished definitively in all situations if an odd number ofcoordination points are made available).

It will be appreciated that coordination points 180(1)-(N) generallyrepresent any type or form of computing device that is capable ofperforming or being used to perform application fencing decisions (e.g.,coordination point 180(1) may be used to resolve application split-brainscenarios for cluster 120 subsequent to a partitioning event).Coordination points 180(1)-(N) may represent one or more coordinationdisks and/or one or more coordination servers that can be used to makeapplication fencing decisions. Examples of coordination points180(1)-(N) include, without limitation, application servers and databaseservers configured to provide various database services and/or runcertain software applications, storage devices (such as disks or diskarrays), laptops, desktops, cellular phones, personal digital assistants(PDAs), multimedia players, embedded systems, and/or combinations of oneor more of the same.

FIG. 6 is a block diagram of nodes that install application-awareregistration keys on coordination points as part of joining a cluster,according to one embodiment. As shown in FIG. 6, node 125(1) executesapplications 140 and 145, node 125(2) executes applications 140, 145,and 155, node 125(3) executes applications 145 and 150, and node 125(4)executes applications 150 and 155. For example, when node 125(1) joinscluster 120, node 125(1) installs two application-aware registrationkeys (e.g., 125(1)-140 and 125(1)-155) on coordination points 180(1),180(2), and 180(3). When node 125(2) joins cluster 120, node 125(2)installs three application-aware registration keys (e.g., 125(2)-140,125(2)-145, and 125(2)-155) on coordination points 180(1), 180(2), and180(3). When node 125(3) joins cluster 120, node 125(3) installs twoapplication-aware registration keys (e.g., 125(3)-145 and 125(3)-150) oncoordination points 180(1), 180(2), and 180(3). Finally, when node125(4) joins cluster 120, node 125(4) installs two application-awareregistration keys (e.g., 125(4)-150 and 125(4)-155) on coordinationpoints 180(1), 180(2), and 180(3). These installed application-awareregistration keys are stored on each of coordination points 180(1),180(2), and 180(3) as registration keys 185(1)-(9). In some embodiments,CPRKM 410 can also be stored on coordination points 180(1), 180(2), and180(3) along with registration keys 185(1)-(9).

An Example of Performing Partition Arbitration Based on ApplicationWeight

FIG. 7 is a block diagram of racer nodes that perform application-awarepartition arbitration, according to one embodiment. Prior to performinga fencing race, a sub-cluster elects a racer node. A racer node is anode that is designated by a sub-cluster to determine whether it canaccess one or more coordination points available to the cluster (as awhole). Typically, a racer node is chosen by the cluster (or designated)based on a node identifier. However, it should be noted that othermethods of choosing and/or designating a racer node other than by nodeidentifier are also contemplated.

As shown in FIG. 7, upon a cluster partitioning event, partition 210selects node 125(1) as the racer node for partition 210, and partition220 selects node 125(3) as the racer node for partition 220. In someembodiments, fencing module 175(1) of node 125(1) performs anapplication fencing operation by installing an identifier on one or morecoordination points. In this example, the identifier associates aninstance of an application with a node on which the instance of theapplication is executing. Fencing module 175(1) then determines a weightassigned to the instance of the application, and terminates the instanceof the application, based, at least in part, on the weight.

For example, AWM 160, as shown in FIG. 7, indicates that all instancesof application 140 execute on partition 210 (e.g., on nodes 125(1) and125(2)) because the weight of application 140 for nodes 125(3) and124(4) according to AWM 160 is zero. Therefore, if all instances of anapplication are running on the same partition, fencing module 175(1)permits application 140 to continue running on nodes 125(1) and 125(2)without performing an application fencing operation, as the applicationfencing operation is not necessary (e.g., there is no split-braincondition in cluster 120). Similarly, AWM 160 indicates that allinstances of application 150 execute on partition 220 (e.g., on nodes125(3) and 125(4)) because the weight of application 150 for nodes125(1) and 125(2) is zero. Therefore, fencing module 175(3) permitsapplication 150 to continue running on nodes 125(3) and 125(4) withoutperforming a fencing operation, as the application fencing operation isnot necessary (e.g., there is no split-brain condition in cluster 120).

However, if various instances of an application are running on twoseparate partitions created as a result of cluster partitioning, fencingmodule performs an application fencing operation that results in thetermination of instance(s) of an application that are executing orrunning on a loser partition (e.g., based on a weight assigned to thatapplication in AWM 160). In this manner, and as part of performing anapplication fencing operation, fencing application 165 causestermination of the instance of the application instead of the node onwhich the instance of the application is executing.

For example, fencing application 165 can determine whether cluster 120is partitioned into network partitions (e.g., partitions 210 and 220).Fencing application 165 can access AWM 160 and perform a partitionarbitration process using AWM 160. If an application is executing on twoseparate partitions (e.g., application 145 which executes on node 125(2)that is part of partition 210, and node 125(3) which executes of node125(3) that is part of partition 220), the partition arbitration processcan include performing a fencing race to determine a winner partitionand a loser partition.

As previously noted, fencing application 165 does not perform (and doesnot need to perform) a fencing race if all instances of an applicationexecute on nodes that are part of the same network partition (e.g.,applications 140 and 150). Conversely, fencing application 165 performsa fencing race if an instance of an application and other instances ofthe same application, execute on separate network partitions (e.g.,applications 145 and 155).

In some embodiments, the fencing race to determine the winner partitionand loser partition(s) is decided based on information in AWM 160. Inthis example, and as part of performing the fencing race, the methoduninstalls or ejects the (application-aware) registration key for theinstance of the application from a coordination point based on theweight assigned to the instance of the application. For example, and asshown in FIG. 7, there are four (4) nodes cluster 120—nodes 125(1)-(4).Upon a cluster partitioning event, two partitions are created—partition210 and partition 220. Each partition elects one racer node. Forinstance, partition 210 elects node 125(1) (shown in bold in FIG. 7) asthe racer node for partition 210, and partition 220 elects node 125(3)(shown in bold in FIG. 7) as the racer node for partition 220.

Nodes 125(1) and 125(3) begin a fencing race independently. If node125(1) can access (or reach) one or more coordination points before node125(3), fencing application 165 starts the fencing race with the racernode (e.g., node 125(1)) to “win” (or claim) the coordination point byejecting, removing, or uninstalling the application-aware registrationkeys of node 125(3) from that coordination point (e.g., 125(3)-145),thus preempting node 125(3) from winning that coordination point. Inthis example, node 125(1), which is the racer node for partition 210accesses AWM 160(1) and identifies the number of applications that arerunning in partition 210 versus partition 220 to determine whether thereis an application-induced split-brain condition in cluster 120. Becauseapplication 145 causes an application split-brain condition, node 125(1)instructs the coordination point to remove (or eject) the registrationkey for node 125(3) from the coordination point. In this manner, node125(1) wins the race for the coordination point.

In some embodiments, fencing application 165 can fine tune the behaviorof cluster 120, for example, by determining a total application weightin cluster 120 (e.g., 48) and then determining a partition weight ofapplications each partition is executing. If each partition has adifferent partition weight, then fencing application 165 can introduce adelay to the fencing race to ensure that a more critical partition canwin the fencing race. If both partitions have the same partition weight(e.g., for application 145), both partitions can enjoy the samepreference. For example, node 125(3) can win the fencing race in theabove example based on factors such as network delay, bandwidth, deviceperformance, and the like.

In the case of application 155, an application-induced split-braincondition exists because the total application weight of 24 is splitbetween partition 210 (16) and partition 220 (8). Therefore, because anapplication split brain condition exists for application 155, the racernode (e.g., node 125(1)) removes, ejects, or uninstalls theapplication-aware registration key 125(4)-155 from the coordinationpoint (e.g., because partition 210 has a higher weight in totalitycompared to partition 220).

Because application-aware registration keys 125(3)-145 and 125(4)-155are ejected, deleted, removed, or uninstalled from the coordinationpoint, when the racer node for partition 220 (e.g., node 125(3)) reachesthe coordination point, the racer node will not find theapplication-aware registration keys 125(3)-145 and 125(4)-155. As aresult, the racer node for partition 220 will terminate competingapplications (e.g., applications 145 and 155) from nodes 125(3) and125(4) respectively, without terminating nodes 125(3) and 125(4)themselves. In this manner, the application-induced split-braincondition is rectified and nodes 125(3) and 125(4) can continue toexecute application 150.

Processes for Performing Application-Aware Fencing Operations

FIG. 8A is a flowchart of a process for receiving an application weightmatrix, according to one embodiment. The process begins at 805 byreceiving an application weight matrix (e.g., AWM 160 from a computingdevice that is not part of cluster 120). At 810, the process transmitsthe AWM to other node(s) in the cluster. For example, node 125(1) canreceive AWM 160 from configuration system 105 as part of configurationfile 110 and transmit AWM 160 to nodes 125(2), 125(3), and the like.

At 815, the process determines whether there is an update to the AWM. Ifthere is an update the AWM, the process loops to 805 and receives the(updated) AWM and re-transmits the (updated) AWM to the other nodes inthe cluster. If there is no update to the AWM, at 820, the processdetermines whether to wait for an update. If waiting is required, theprocess loops to 815 and determines if there is an update (and if thereis indeed an update, loops to 805, as noted above). However, if nowaiting is required, the process ends.

FIG. 8B is a flowchart of a process for generating application-awareregistration keys, according to one embodiment. The process begins at825 by accessing an AWM. For example, node 125(1) can access AWM 160(1).At 830, the process generates application-aware registration key(s)based on the information in the AWM (e.g., based on which applicationsare executing on which nodes in cluster 120). At 835, the processinstalls, registers, or places the generated application-awareregistration key(s) on one or more coordination points.

At 840, the process determines whether there is a new node in thecluster. If a new node has joined the cluster, the process loops to 825and generates registration key(s) for the new node based on theapplications that are executing on the new node and installs the (new)registration key(s) on the one or more coordination points. However, ifa new node has not joined the cluster, the process, at 845, determineswhether there is a need to wait for a new node. If there is such a need,the process loops back to 840. If there is no such need, the processends.

FIG. 9A is a flowchart of a process for installing application-awareregistration keys on coordination points, according to one embodiment.The process begins at 905 by detecting a node joining a cluster orwaiting to find a new node. At 910, the process determines whether thereare (one or more) instances of application(s) executing on the node. Ifthere are no applications executing on the node, the process loops backto 905 and detects whether another node joins the cluster. If there areinstance(s) of application(s) executing on the node, the process, at915, associates the instance(s) of the application(s) and the node.

At 920, the process generates (application-aware) registration key(s)for the instance(s) of the application(s) based on the association. At925, the process installs the registration key(s) on one or morecoordination points (e.g., on an odd number of coordination pointsgreater than three). At 930, the process determines if there is anotherapplication (or application instance) that has begun to execute on thenode. If so, the process loops back to 915, and generates and installs anew registration key on the coordination points. If not, the process, at935, determines whether there is a need to continue to detect node(s)that may join the cluster. If there is such a need, the process loopsback to 905. If there is no such need, the process ends.

FIG. 9B is a flowchart of a process for generating a coordination pointregistration key matrix, according to one embodiment. The process beginsat 940 by accessing a coordination point. At 945, the process determineswhether there is any registration key(s) installed on the coordinationpoint. If no registration key(s) are installed, the process, at 950,waits for node(s) to join the cluster. However, if registration key(s)are installed on the coordination point, the process, at 955, generatesa coordination point registration key matrix (e.g., CPRKM 410).

At 960, the process stores the CPRKM on the coordination point (e.g.,along with the installed registration key(s)). As previously noted, thegeneration and storing of the CPRKM is optional. At 965, the processdetermines whether new registration key(s) are installed on thecoordination point. If new registration keys are installed, the process,at 970, updates the CPRKM. However, if no new registration keys areinstalled, the process ends.

FIG. 10 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment. The process begins at1005 by installing application-aware registration key(s) on coordinationpoint(s). At 1010, the process detects partitioning of the cluster(e.g., cluster 120). At 1015, the process determines whether clusterpartitioning has occurred. If cluster partitioning has not occurred, theprocess reverts to detecting partitioning of the cluster. However, ifthe cluster has indeed been partitioned, the process, at 1020,identifies an application instance that is causing an applicationsplit-brain condition in the cluster (or an application-inducedsplit-brain condition).

At 1025, the process accesses a weight assigned to the applicationinstance that is causing the application split-brain condition in thecluster in the AWM. At 1030, the process initiates partition arbitrationfor the application instance that is causing the split-brain condition.At 1035, the process determines whether the application instance is partof a winner partition or a loser partition (e.g., as a result ofperforming a fencing race as part of the partition arbitration process).If the application instance is part of a winner partition, the process,at 1040, broadcasts the result to the other node(s) in the cluster.

However, if the application instance is part of a loser partition, theprocess, at 1045, deletes, removes, uninstalls, or ejects the(application-aware) registration key of the application instance fromthe coordination point, and at 1050, terminates the application instanceon the node as part of the fencing operation. At 1055, the processdetermines whether there is another application. If there is anotherapplication (or application instance), the process loops back to 1005.If there are no more application(s), the process ends.

FIG. 11 is a flowchart of a process for uninstalling application-awareregistration keys from coordination points, according to one embodiment.The process begins at 1105 by determining location(s) of applicationinstances (e.g., the various nodes on which the application instance(s)are executing as specified in an AWM). At 1110, the process determineswhether there is a split-brain condition in the cluster because ofnetwork partitioning. If there is no split-brain condition in thecluster due to the network partitioning, the process loops back to 1105.However, if there is a split-brain condition in the cluster due to thenetwork partitioning, the process, at 1115, determines whetherapplication instances are executing on separate partitions.

If the application instances are not executing on separate partitions,the process, at 1120, allows the application instances to continuerunning (e.g., because there is no application-induced split-brain andthere is no need to perform a fencing face). However, if the applicationinstances are executing on separate partitions, the process, at 1125,initiates a fencing operation to rectify the application split-braincondition. As part of the fencing operation, the process, at 1130,initiates a partition arbitration process that includes a fencing raceto determine winner and loser partitions (or node groups) based onapplication weight specified in the AWM.

At 1135, the process uninstalls, deletes, removes, or ejectsapplication-aware registration keys of application instance(s) in loserpartition(s) from coordination point(s), and at 1140 receivesconfirmation from node(s) in loser partition(s) that the applicationinstance(s) have been terminated as part of the application fencingoperation. At 1145, the process determines if there is a need tocontinue to detect an application-induced split-brain condition. Ifthere is such a need, the process loops to 1105. If there is no suchneed, the process ends.

FIG. 12 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment. The process begins at1205 by determining if there is there network partitioning event. Ifthere is a network partitioning event, the process, at 1210, accesses anAWM. At 1215, the process determines whether a weight of an applicationin a given partition is zero, or whether the weight of the applicationin the given partition is equal to a total weight of the applicationacross the whole cluster (e.g., before the cluster was partitioned). Ifthe weight of an application in the given partition not zero, or if theweight of the application in the given partition not is equal to thetotal weight of the application across the whole cluster, the processproceeds to FIG. 13.

If the weight of the application in the given partition is zero, theapplication is not running or executing in the given partition. If theweight of the application in the given partition is equal to the totalweight of the application across the whole cluster, the application isrunning or executing entirely in the given partition. In both cases,there is no application split-brain condition, and fencing application165 can determine that the application has (preemptively) won thefencing race (e.g., without needing to perform such a race and withoutneeding to access coordination points as part of performing such arace). Fencing application 165 flags the application for broadcast andnotifies the other node(s) in the cluster that there is noapplication-induced split-brain condition that needs rectification.Therefore, at 1220, the process stores the application and broadcasts a“won race” for the application, and at 1125, permits the application toresume (operations) without (performing) a fencing race. At 1230, theprocess determines if there is another application in the (given)partition. If there is another application, the process loops back to1210. If not, the process ends at 1235.

FIG. 13 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment. The process begins at1305 by determining that the weight of an application in the givenpartition is not zero, or the weight of the application in the givenpartition not is equal to the total weight of the application across thewhole cluster. At 1310, the process determines whether the weight of theapplication in the given partition is greater than or equal to theweight of the application in another partition (e.g., a leavingpartition).

If the weight of the application in the given partition is greater thanor equal to the weight of the application in another partition, theprocess, at 1315, creates a bucket B1 and stores the application for acumulative fencing race for all such applications (e.g., all applicationwhere the weight of such applications in the given partition is greaterthan or equal to the weight of such applications in another partition).In one embodiment, a cumulative fencing race can improve the speed andperformance of fencing operations by permitting a node in a cluster tosubmit a single request to a coordination point to remove, delete,uninstall, boot, or eject multiple application-aware registrationkey(s). In another embodiment, buckets B1 and B2 are arrays, and can bemaintained by a racer node.

However, if the weight of the application in the given partition is notgreater than or equal to the weight of the application in anotherpartition, the process, at 1320, creates a bucket B2 and stores theapplication for a cumulative fencing race for all such applications(e.g., all application where the weight of such applications in thegiven partition is not greater than or equal to the weight of suchapplications in another partition). At 1325, the process introduces adelay. For example, if the application has a greater weight in the givenpartition (e.g., partition 210) compared to another partition (e.g.,partition 220), then the given partition can commence the fencing raceimmediately, and the another partition can introduce the delay.Therefore, at 1330, the process starts the fencing race for bucket B1 orB2. The process ends at 1135 by ejecting or uninstallingapplication-aware registration key(s) of the leaving partition fromcoordination point(s).

In addition to buckets B1 and B2, several other buckets or arrays can begenerated depending on the number of applications and the respectiveweights of these applications in different partitions. In oneembodiment, a Bucket A can include applications whose instances are allrunning in the racer node's partition. In this case, fencing application165 can preemptively declare a “won” race and notify other node(s) inthe cluster because the coordination points do not have theapplication-aware registration key(s) for these nodes. In anotherembodiment, a Bucket B can include applications for which theapplication weight in the racer node's partition is more than the restof the cluster. In this case, the racer node starts the fencing racewithout delay and removes the application-aware registration key(s) ofthe other partition(s) from coordination point(s). In some embodiments,a Bucket C can include applications for which the application weight inthe racer node's partition is “w1” units less than the rest of thecluster. In this case, the racer node begins the fencing race after adelay of “x” seconds. In other embodiments, a Bucket D can includeapplications for which the application weight in the racer's partitionis “w1 to w2” units less than the rest of the cluster. In this case, theracer node begins the fencing race after a delay of “y” seconds.Therefore, in this manner, multiple buckets or arrays can be created orgenerated based on the number of applications executing on various nodesin the cluster and the respective weights of these application in theAWM.

FIG. 14 is a flowchart of a process for performing an applicationfencing operation, according to one embodiment. The process begins at1405 by determining whether a number of coordination points won (e.g.,by a racer node) is greater than or equal to a total number ofcoordination points divided by two plus one. If the number ofcoordination points won is greater than or equal to the total number ofcoordination points divided by two plus one, the process, at 1410,broadcasts a “won” fencing race for application(s) (e.g., in the racernode's partition), and ends at 1415 by unblocking clients (e.g., thuspermitting access to shared storage). However, if the number ofcoordination points won is not greater than or equal to the total numberof coordination points divided by two plus one, the process, at 1420,broadcasts a “lost” fencing race for application(s) (e.g., not in theracer node's partition), and ends at 1425 by receiving confirmation(e.g., from node(s) in the loser partition(s)) that application(s) (orapplication instance(s)) in the loser partition have been terminated.

It will be appreciated that the methods, systems, and processesdisclosed herein perform application fencing operations by causing thetermination of an instance of an application instead of terminating thenode on which that instance of the application is executing, thusimproving cluster availability and performance.

An Example Computing Environment

FIG. 15 is a block diagram of a computing system, illustrating how afencing module can be implemented in software, according to oneembodiment. Computing system 1500 broadly represents any single ormulti-processor computing device or system capable of executingcomputer-readable instructions. Examples of computing system 1500include, without limitation, any one or more of a variety of devicesincluding workstations, personal computers, laptops, client-sideterminals, servers, distributed computing systems, handheld devices(e.g., personal digital assistants and mobile phones), networkappliances, storage controllers (e.g., array, tape drive, or hard drivecontrollers), and the like. Computing system 1500 may include at leastone processor 130 and a memory 135. By executing the software thatimplements fencing module 175, computing system 1500 becomes a specialpurpose computing device that is configured to perform application-awareinput-output (I/O) fencing operations.

Processor 130 generally represents any type or form of processing unitcapable of processing data or interpreting and executing instructions.In certain embodiments, processor 130 may receive instructions from asoftware application or module. These instructions may cause processor130 to perform the functions of one or more of the embodiments describedand/or illustrated herein. For example, processor 130 may perform and/orbe a means for performing all or some of the operations describedherein. Processor 130 may also perform and/or be a means for performingany other operations, methods, or processes described and/or illustratedherein.

Memory 135 generally represents any type or form of volatile ornon-volatile storage devices or mediums capable of storing data and/orother computer-readable instructions. Examples include, withoutlimitation, random access memory (RAM), read only memory (ROM), flashmemory, or any other suitable memory device. Although not required, incertain embodiments computing system 1500 may include both a volatilememory unit and a non-volatile storage device. In one example, programinstructions implementing fencing module 175 may be loaded into memory135.

In certain embodiments, computing system 1500 may also include one ormore components or elements in addition to processor 130 and/or memory135. For example, as illustrated in FIG. 15, computing system 1500 mayinclude a memory controller 1520, an Input/Output (I/O) controller 1535,and a communication interface 1545, each of which may be interconnectedvia a communication infrastructure 1505. Communication infrastructure1505 generally represents any type or form of infrastructure capable offacilitating communication between one or more components of a computingdevice. Examples of communication infrastructure 1505 include, withoutlimitation, a communication bus (such as an Industry StandardArchitecture (ISA), Peripheral Component Interconnect (PCI), PCI express(PCIe), or similar bus) and a network.

Memory controller 1520 generally represents any type/form of devicecapable of handling memory or data or controlling communication betweenone or more components of computing system 1500. In certain embodimentsmemory controller 1520 may control communication between processor 130,memory 135, and I/O controller 1535 via communication infrastructure1505. In certain embodiments, memory controller 1520 may perform and/orbe a means for performing, either alone or in combination with otherelements, one or more of the operations or features described and/orillustrated herein.

I/O controller 1535 generally represents any type or form of modulecapable of coordinating and/or controlling the input and outputfunctions of a virtual machine, an appliance, a gateway, a cluster, anode, and/or a computing system. For example, in certain embodiments I/Ocontroller 1535 may control or facilitate transfer of data between oneor more elements of cluster 120, coordination points 180(1)-(N), datadisks 195(10-(N), and/or nodes 125(1)-(N), such as processor 130, memory135, communication interface 1545, display adapter 1515, input interface1525, and storage interface 1540.

Communication interface 1545 broadly represents any type or form ofcommunication device or adapter capable of facilitating communicationbetween computing system 1500 and one or more other devices.Communication interface 1545 may facilitate communication betweencomputing system 1500 and a private or public network includingadditional computing systems. Examples of communication interface 1545include, without limitation, a wired network interface (such as anetwork interface card), a wireless network interface (such as awireless network interface card), a modem, and any other suitableinterface. Communication interface 1545 may provide a direct connectionto a remote server via a direct link to a network, such as the Internet,and may also indirectly provide such a connection through, for example,a local area network (e.g., an Ethernet network), a personal areanetwork, a telephone or cable network, a cellular telephone connection,a satellite data connection, or any other suitable connection.

Communication interface 1545 may also represent a host adapterconfigured to facilitate communication between computing system 1500 andone or more additional network or storage devices via an external bus orcommunications channel. Examples of host adapters include, SmallComputer System Interface (SCSI) host adapters, Universal Serial Bus(USB) host adapters, Institute of Electrical and Electronics Engineers(IEEE) 1394 host adapters, Serial Advanced Technology Attachment (SATA),Serial Attached SCSI (SAS), and external SATA (eSATA) host adapters,Advanced Technology Attachment (ATA) and Parallel ATA (PATA) hostadapters, Fibre Channel interface adapters, Ethernet adapters, or thelike. Communication interface 1545 may also allow computing system 1500to engage in distributed or remote computing (e.g., by receiving/sendinginstructions to/from a remote device for execution).

As illustrated in FIG. 15, computing system 1500 may also include atleast one display device 1510 coupled to communication infrastructure1505 via a display adapter 1515. Display device 1510 generallyrepresents any type or form of device capable of visually displayinginformation forwarded by display adapter 1515. Similarly, displayadapter 1515 generally represents any type or form of device configuredto forward graphics, text, and other data from communicationinfrastructure 1505 (or from a frame buffer, as known in the art) fordisplay on display device 1510. Computing system 1500 may also includeat least one input device 1530 coupled to communication infrastructure1505 via an input interface 1525. Input device 1530 generally representsany type or form of input device capable of providing input, eithercomputer or human generated, to computing system 1500. Examples of inputdevice 1530 include a keyboard, a pointing device, a speech recognitiondevice, or any other input device.

Computing system 1500 may also include storage device 1550 coupled tocommunication infrastructure 1505 via a storage interface 1540. Storagedevice 1550 generally represents any type or form of storage devices ormediums capable of storing data and/or other computer-readableinstructions. For example, storage device 1550 may include a magneticdisk drive (e.g., a so-called hard drive), a floppy disk drive, amagnetic tape drive, an optical disk drive, a flash drive, or the like.Storage interface 1540 generally represents any type or form ofinterface or device for transferring and/or transmitting data betweenstorage device 1550, and other components of computing system 1500.

Storage device 1550 may be configured to read from and/or write to aremovable storage unit configured to store computer software, data, orother computer-readable information. Examples of suitable removablestorage units include a floppy disk, a magnetic tape, an optical disk, aflash memory device, or the like. Storage device 1550 may also includeother similar structures or devices for allowing computer software,data, or other computer-readable instructions to be loaded intocomputing system 1500. For example, storage device 1550 may beconfigured to read and write software, data, or other computer-readableinformation. Storage device 11550 may also be a part of computing system1500 or may be separate devices accessed through other interfacesystems.

Many other devices or subsystems may be connected to computing system1500. Conversely, all of the components and devices illustrated in FIG.15 need not be present to practice the embodiments described and/orillustrated herein, and the devices and subsystems referenced above mayalso be interconnected in different ways. Computing system 1500 may alsoemploy any number of software, firmware, and/or hardware configurations.Embodiments disclosed herein may be encoded as a computer program (alsoreferred to as computer software, software applications,computer-readable instructions, or computer control logic) on acomputer-readable storage medium. Examples of computer-readable storagemedia include magnetic-storage media (e.g., hard disk drives and floppydisks), optical-storage media (e.g., CD- or DVD-ROMs),electronic-storage media (e.g., solid-state drives and flash media), andthe like. Such computer programs can also be transferred to computingsystem 1500 for storage in memory via a network such as the Internet orupon a carrier medium.

The computer-readable medium containing the computer program may beloaded into computing system 1500 and/or nodes 125(1)-(N). All or aportion of the computer program stored on the computer-readable mediummay then be stored in memory 135 and/or various portions of storagedevice 1550. When executed by processor 130, a computer program loadedinto computing system 1500 may cause processor 130 to perform and/or bea means for performing the functions of one or more of the embodimentsdescribed and/or illustrated herein. Additionally or alternatively, oneor more of the embodiments described and/or illustrated herein may beimplemented in firmware and/or hardware. For example, computing system1500 may be configured as an application specific integrated circuit(ASIC) adapted to implement one or more of the embodiments disclosedherein.

An Example Networking Environment

FIG. 16 is a block diagram of a networked system, illustrating howvarious devices can communicate via a network, according to oneembodiment. In certain embodiments, network-attached storage (NAS)devices may be configured to communicate with nodes 125(1)-(N) incluster 120, and/or coordination points 180(1)-(N) using variousprotocols, such as Network File System (NFS), Server Message Block(SMB), or Common Internet File System (CIFS). Network 115 generallyrepresents any type or form of computer network or architecture capableof facilitating communication between nodes 125(1)-(N) in cluster 120,coordination points 180(1)-(N), and data disks 195(1)-(N). In certainembodiments, a communication interface, such as communication interface1545 in FIG. 15, may be used to provide connectivity between nodes125(1)-(N) in cluster 120, coordination points 180(1)-(N), data disks195(1)-(N), and network 115. It should be noted that the embodimentsdescribed and/or illustrated herein are not limited to the Internet orany particular network-based environment. For example, network 115 canbe a Storage Area Network (SAN).

In one embodiment, all or a portion of one or more of the disclosedembodiments may be encoded as a computer program and loaded onto andexecuted by nodes 125(1)-(N) and/or coordination points 180(1)-(N). Allor a portion of one or more of the embodiments disclosed herein may alsobe encoded as a computer program, stored on nodes 125(1)-(N) and/orcoordination points 180(1)-(N), and distributed over network 115. Insome examples, all or a portion of nodes 125(1)-(N), cluster 120, and/orcoordination points 180(1)-(N) may represent portions of acloud-computing or network-based environment. Cloud-computingenvironments may provide various services and applications via theInternet. These cloud-based services (e.g., software as a service,platform as a service, infrastructure as a service, etc.) may beaccessible through a web browser or other remote interface. Variousfunctions described herein may be provided through a remote desktopenvironment or any other cloud-based computing environment.

In addition, one or more of the components described herein maytransform data, physical devices, and/or representations of physicaldevices from one form to another. For example, fencing module 175 maytransform the behavior of nodes 125(1)-(N) in order to cause nodes125(1)-(N) to perform application-aware I/O fencing operations.

Although the present disclosure has been described in connection withseveral embodiments, the disclosure is not intended to be limited to thespecific forms set forth herein. On the contrary, it is intended tocover such alternatives, modifications, and equivalents as can bereasonably included within the scope of the disclosure as defined by theappended claims.

What is claimed is:
 1. A method comprising: performing an applicationfencing operation, comprising installing an identifier on one or morecoordination points, wherein the identifier associates an instance ofthe application with a node on which the instance of the application isexecuting; determining a weight assigned to the instance of theapplication; and terminating the instance of the application based, atleast in part, on the weight.
 2. The method of claim 1, furthercomprising: as part of performing the application fencing operation,causing termination of the instance of the application instead of thenode on which the instance of the application is executing.
 3. Themethod of claim 1, further comprising: accessing an application weightmatrix, wherein the application weight matrix comprises the weightassigned to the instance of the application.
 4. The method of claim 3,further comprising: receiving the application weight matrix; andgenerating the identifier for the instance of the application, whereinthe identifier is a registration key; and as part of the installing,storing the registration key on one or more coordination points.
 5. Themethod of claim 4, wherein the one or more coordination points comprise,at least one of one or more coordinator disks, or one or morecoordination point servers.
 6. The method of claim 4, furthercomprising: determining whether a cluster is partitioned into aplurality of network partitions; accessing the application weightmatrix; and performing a partition arbitration process using theapplication weight matrix.
 7. The method of claim 6, wherein theperforming the partition arbitration process comprises: performing afencing race to determine a winner partition and one or more loserpartitions of the plurality of network partitions.
 8. The method ofclaim 7, further comprising: excluding the instance of the applicationand one or more other instances of the application from the fencingrace, if both the instance of the application and the one or more otherinstances of the application execute on two or more nodes of theplurality of nodes that are part of a same network partition of theplurality of network partitions; and including the instance of theapplication and the one or more other instances of the application inthe fencing race, if the application and the one or more other instancesof the application execute on separate network partitions of theplurality of network partitions.
 9. The method of claim 7, wherein thefencing race to determine the winner partition and the one or more loserpartitions is decided based on information in the application weightmatrix.
 10. The method of claim 9, further comprising: as part of thefencing race, uninstalling the registration key for the instance of theapplication from a coordination point of the one or more coordinationpoints based on the weight assigned to the instance of the application.11. A non-transitory computer-readable storage medium (CRM) storingprogram instructions executable to: perform an application fencingoperation, comprising installing an identifier on one or morecoordination points, wherein the identifier associates an instance of anapplication with a node on which the instance of the application isexecuting; determining a weight assigned to the instance of theapplication; and terminating the instance of the application based, atleast in part, on the weight.
 12. The non-transitory CRM of claim 11,further comprising: as part of performing the application fencingoperation, causing termination of the instance of the applicationinstead of the node on which the instance of the application isexecuting.
 13. The non-transitory CRM of claim 11, further comprising:accessing an application weight matrix, wherein the application weightmatrix comprises the weight assigned to the instance of the application;receiving the application weight matrix; generating the identifier forthe instance of the application, wherein the identifier is aregistration key; and as part of the installing, storing theregistration key on one or more coordination points.
 14. Thenon-transitory CRM of claim 13, wherein the one or more coordinationpoints comprise, at least one of one or more coordinator disks, or oneor more coordination point servers.
 15. The non-transitory CRM of claim13, further comprising: determining whether a cluster is partitionedinto a plurality of network partitions; accessing the application weightmatrix; performing a partition arbitration process using the applicationweight matrix, wherein the performing the partition arbitration processcomprises performing a fencing race to determine a winner partition andone or more loser partitions of the plurality of network partitions,wherein the fencing race to determine the winner partition and the oneor more loser partitions is decided based on information in theapplication weight matrix; and as part of the fencing race, uninstallingthe registration key for the instance of the application from acoordination point of the one or more coordination points based on theweight assigned to the instance of the application; excluding theinstance of the application and one or more other instances of theapplication from the fencing race, if both the instance of theapplication and the one or more other instances of the applicationexecute on two or more nodes of the plurality of nodes that are part ofa same network partition of the plurality of network partitions; andincluding the instance of the application and the one or more otherinstances of the application in the fencing race, if the application andthe one or more other instances of the application execute on separatenetwork partitions of the plurality of network partitions.
 16. A systemcomprising: one or more processors; and a memory coupled to the one ormore processors, wherein the memory stores program instructionsexecutable by the one or more processors to: perform an applicationfencing operation, comprising installing an identifier on one or morecoordination points, wherein the identifier associates an instance of anapplication with a node on which the instance of the application isexecuting; determining a weight assigned to the instance of theapplication; and terminating the instance of the application based, atleast in part, on the weight.
 17. The system of claim 16, furthercomprising: as part of performing the application fencing operation,causing termination of the instance of the application instead of thenode on which the instance of the application is executing.
 18. Thesystem of claim 16, further comprising: accessing an application weightmatrix, wherein the application weight matrix comprises the weightassigned to the instance of the application; receiving the applicationweight matrix; and generating the identifier for the instance of theapplication, wherein the identifier is a registration key; and as partof the installing, storing the registration key on one or morecoordination points.
 19. The system of claim 18, wherein the one or morecoordination points comprise, at least one of one or more coordinatordisks, or one or more coordination point servers.
 20. The system ofclaim 18, further comprising: determining whether a cluster ispartitioned into a plurality of network partitions; accessing theapplication weight matrix; and performing a partition arbitrationprocess using the application weight matrix,
 21. The system of claim 20,wherein the performing the partition arbitration process comprisesperforming a fencing race to determine a winner partition and one ormore loser partitions of the plurality of network partitions, whereinthe fencing race to determine the winner partition and the one or moreloser partitions is decided based on information in the applicationweight matrix; and as part of the fencing race, uninstalling theregistration key for the instance of the application from a coordinationpoint of the one or more coordination points based on the weightassigned to the instance of the application.
 22. The system of claim 21,further comprising: excluding the instance of the application and one ormore other instances of the application from the fencing race, if boththe instance of the application and the one or more other instances ofthe application execute on two or more nodes of the plurality of nodesthat are part of a same network partition of the plurality of networkpartitions; and including the instance of the application and the one ormore other instances of the application in the fencing race, if theapplication and the one or more other instances of the applicationexecute on separate network partitions of the plurality of networkpartitions.